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Abstract 

There is much empirical evidence that item-item collaborative filtering works well 
in practice. Motivated to understand this, we provide a framework to design and 
analyze various recommendation algorithms. The setup amounts to online binary 
matrix completion, where at each time a random user requests a recommendation 
and the algorithm chooses an entry to reveal in the user’s row. The goal is to mini¬ 
mize regret, or equivalently to maximize the number of -fl entries revealed at any 
time. We analyze an item-item collaborative filtering algorithm that can achieve 
fundamentally better performance compared to user-user collaborative filtering. 
The algorithm achieves good “cold-start” performance (appropriately defined) by 
quickly making good recommendations to new users about whom there is little 
information. 


1 Introduction 

A natural approach to automated recommendation systems is to use content specific data; similar 
words in two books’ titles suggest that they are similar, and similarly for user’s age and geographic 
location. Recommending based on such content-specihc data is called content filtering. In contrast, a 
technique called collaborative filtering (CE) provides recommendation in a content-agnostic way by 
exploiting patterns in general purchase or usage data to determine similarity (the term collaborative 
hltering was coined in [1]). Eor example, if 90% of users agree on two items, a CE algorithm might 
recommend the second item to a user who likes hrst item. 

Virtually all industrial recommendation systems use CE. There are two main paradigms in 
neighborhood-based CE: user-user and item-item. In the user-user paradigm, recommendation to 
user u is done by hnding users similar to u and recommending items liked by these users. In the 
item-item paradigm, in contrast, items similar to those liked by the user are found and then recom¬ 
mended. Empirical evidence shows that the item-item paradigm performs well (cf. [2] and [3]). 
In this paper we provide theoretical justihcation by introducing a model for recommendation and 
analyze the performance of a simple item-item CE algorithm. 

1.1 Model 

We consider a system with N users and collection of items X. Eor each item i G X, user u has 
binary preference Lu,i equal to -1-1 (like) or —1 (dislike). Recommendation systems typically op¬ 
erate in an online setting, meaning that when a user logs into a virtual store (such as Amazon), a 
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recommendation must be made immediately. At each discrete time step t = 1, 2,3,... a uniformly 
random user Ut G N} requests a recommendation. The recommendation algorithm selects 

an item It to recommend from the set of available items X, after which Ut gives feedback 
The recommendation must depend only on previous feedback: U is required to be measurable with 
respect to the sigma-field generated by the history , {Ut-i, It-i, Lut-ijt-i)- 

We impose the constraint that the recommendation algorithm may only recommend each item to a 
given user at most once. This captures the situation where users do not want to watch a movie or 
read a book more than once and focuses attention on the ability to recommend new items to users. 
For each recommendation, the algorithm may therefore either recommend an item that has been 
previously recommended to other users (in which case it has some information about the item) or 
recommend a new item from X. 

We are interested in the situation where there are many items, and will assume that X is infinite. For 
a given item i, the corresponding i* column L. ^ G {—!> +1}^ of each user’s preferences is called 
the type of item i. It is convenient to represent the population of items by a probability measure p 
over { — 1, +1}^. When the algorithm selects an item that has not yet been recommended, the item’s 
type is drawn from this distribution in an i.i.d. manner. Recommending a new item corresponds to 
adding a column to the rating matrix, with binary preferences jointly distributed according to p. 

1.2 Performance measure 

As is standard in the online decision-making literature, algorithm performance is measured by regret 
relative to an all-knowing algorithm that makes no bad recommendations. The regret at time T is 

1 ^'^1 

t=i 

Recall that at time t user Ut G N} desires a recommendation. It is the recommended item, 

and Lu,i is equal to +1 (resp. —1) if u likes (resp. dislikes) item i. The regret 71{T) is the number 
of bad recommendations per user after having made an average of T recommendations per user. 
Dependence on the algorithm is implicit through U . 

1.3 Main results 

We now describe two high-level objectives in designing a recommendation system and the corre¬ 
sponding guarantees obtained for our proposed algorithm ITEM-ITEM-CF, which is described in 
Sections 3 and 4. The results are stated in more detail in Section 5. 

1.3.1 Cold-start time 

With no prior information, the algorithm should give reliable recommendations as quickly as possi¬ 
ble. The cold-start time of a recommendation algorithm is defined as 

Tcoid-start = min |r + F s.t. , T, r > 0, E [7^(^ + A) - 7^(T)] < O.IA, VA > r|. (1) 

This is the first time after which the slope of the expected regret is bounded by 0.1: after Tcoid-start 
the algorithm makes a bad recommendation to a randomly chosen user with probability at most 0.1.' 

Our results: As described in Section 2, we assume that each user likes at least i/ > 0 fraction of 
the items. In Theorem 5.1 we show that algorithm ITEM-ITEM-CF achieves Tcoid-start = Oil? 
for N > Ao(d). Note that one must typically randomly sample n(p) items to find a single liked 

*The choice 0.1 is arbitrary. We will assume that users like only a small fraction of the items, so the 
cold-start time is the minimum time after which the algorithm can recommend significantly better than random. 
^Here and throughout, r = 0{x) means r < Cx log'’ x for some numerical constants c, C. 
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item. Our results show that this amount of time-investment suffices in order to give consistently 
good recommendations. 

1.3.2 Improving accuracy 

The algorithm should give increasingly more reliable recommendations as it gains information about 
the users and items. This is captured by having sublinear expected regret E [TZ (T)] = o (T). 

Our results: Proposition 2.1 shows that without assumptions on the item space, it is impossible for 
any online algorithm to achieve sublinear regret for any positive length of time. In this paper we 
assume that the item space has doubling dimension (a measure of complexity of the space, defined 
and motivated later) bounded by d. In Theorem 5.2 we show that after time Tcoid-start (until which 
we incur linear regret), algorithm ITEM-ITEM-CF achieves sublinear expected regret 0{T ) up 

until a certain time T^ax- After T^ax the expected regret again grows linearly (but with much 
smaller slope), and this behavior is shown in Theorem 5.3 to be unavoidable. As will be made 
explicit, performance improves with increasing number of users: Tmax (and hence the length of the 
sublinear time-period) increases with N and the eventual linear slope decreases with N, both of 
which illustrate the so-called collaborative gain. 

Remark 1.1. The mathematical formulation of cold-start time is to the best of our knowledge new. 
The strong guarantee we obtain on cold-start time (independent of doubling dimension d) is distinct 
from and does not follow as an implication of the sublinear regret result (which does depend on d). 

1.4 Related Work 

A latent source model of user types is used by [4] to give performance guarantees for user-user 
collaborative filtering. The assumptions on users and items are closely related since K items types 
induce at most 2^ user types and vice versa (the K item types liked by a user fully identify the 
user’s preferences, and there are at most 2^ such choices). Since we study algorithms that cluster 
similar items together, in this paper we assume a latent structure of items. We note that unlike the 
standard mixture model with minimum separation between mixture components (as assumed in [4]), 
our setup does not have any such gap condition. In contrast, we allow an effectively arbitrary model, 
and we prove performance guarantees based on a notion of dimensionality of the item space. 

Our model can be thought of as a certain extended multi-armed bandit problem. The papers [5, 6] 
use notions of dimensionality similar to the one in this paper. They assume that the set of arms X 
has some geometry: [5] assumes that the arm space is endowed with a metric, and [6] assumes that 
the arms have a dissimilarity function (which is not necessarily a metric). The expected rewards are 
then related to this geometry of the arms. In the former work, the difference in expected rewards is 
Lipschitz^ in the distance, and in the latter work the dissimilarity function constrains the slope of 
the reward around its maxima. 

~ d'+i 

The regret of the algorithms in [5, 6] is 0[T <^'+2 j, where d is a weaker notion (than the one we 
use) of the covering number of X, and is closely related to the doubling dimension (which we define 
later) in the case of a metric. The regret bound in Theorem 5.2 for the sublinear regime is of the 
same form, but two important aspects of our model require a different algorithm and more intricate 
arguments: ii) in our case no repeat recommendations (i.e. pulling the same arm) can be made to 
the same user, and {ii) we do not have an oracle for distances between users and items, and instead 
we must estimate distances by making carefully chosen exploratory recommendations. 

Aside from these differences, the nature of the collaborative filtering problem leads to additional 
novelty relative to existing work on multi-armed bandits. First, we formalize the cold-start problem 
and prove strong guarantees in this regard. Second, all of our bounds are in terms of system param¬ 
eters. This allows, for example, to see the role of the number of users N as an important resource 
allowing for collaboration. 

^That is, |E[ri] — E[rj]| < C ■ d{i,j), where d is the distance between arms, and C is a constant. 
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The works of [7] and [8] on online learning and matrix completion are also relevant. In their case, 
however, the matrix entries to be predicted are not chosen by the algorithm and hence there is no 
explore-exploit trade-off. The paper [9] considers collaborative filtering under a mixture model in 
the offline setting, and they make separation assumptions between the item types (called genres in 
their paper). The work [10] considers a setting similar to ours (but with finite number of user and 
item types) and proves certain guarantees on a moving horizon approximation rather than the cumu¬ 
lative anytime regret. The paper [11] proves asymptotic consistency guarantees on estimating the 
ratings of unrecommended items. The recent paper [12] considers a different model in which re¬ 
peat recommendations are also not allowed, but they make recommendations by exploiting existing 
information about users’ interests. 

It is possible that using the similarities between users, and not just between items as we do, is also 
useful. This has been studied theoretically in the user-user collaborative filtering framework in [4], 
via bandits in a wide variety of settings (for instance [13, 14, 15]), with focus on benefits to the 
cold-start problem [16, 17], and in practice (cf [18, 19]). In this paper, in order to capture the power 
of purely item-item collaborative filtering, we intentionally avoid using any user-user similarities. 

2 Structure in Data 

The main intuition behind all variants of collaborative filtering is that users and items can typically 
be clustered in a meaningful way even ignoring context specific data. Items, for example, can often 
be grouped into a few different types that tend to be liked by the same users. 

2.1 Need for Structure 

As discussed, a good recommendation algorithm suggests items to users that are liked but have not 
been recommended to them before. In order to motivate the need for assumptions on the item space, 
we begin by stating the intuitive result that in the worst case when p has little structure, no online 
algorithm can do better than recommending random items. 

Proposition 2.1 (Lower Bound). Let fji be the uniform distribution over { — 1, +1}^. Then for aii 
T > 1, the expected regret of any oniine recommendation aigorithm is iower bounded as E [7^(T)] > 
T /2. Converseiy, recommending a random item at each time step achieves E[7^(T)] = r/2. 

Proposition 2.1 states that no online algorithm can have sublinear regret for any period of time unless 
some structural assumptions are made. Hence, to have any coiiaborative gain we need to capture 
the fact that items tend to come in clusters of similar items. We make two assumptions. 

(Al) The distribution p, over the item space has doubling dimension at most d for a given d> 0. 

(A2) Each user likes a random item drawn from p with probability between u and 2o, and each 
item is liked by a fraction between o and 2^ of the users, for a given u G (0,1/4). 

Assumption Al captures structure in the item space through the notion of doubling dimension, 
defined and motivated in Section 2.2. Assumption A2 is made to avoid the extreme situations where 
almost no items are liked (in which case good recommendations are impossible) or most items are 
liked (in which case the regret benchmark becomes meaningless). 

2.2 Item Types and Doubling Dimension 

We endow the A-dimensional Hamming cube { — 1, -1-1}^ with the following normalized ii metric: 
for any two item types x,y G { — 1, +1}^, define their distance 
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When we write jij for items i, j, we mean the distance between their types, which is the fraction of 
users that disagree on them. 

Let B{x, r) = {y G {—1, +1}^ : 'yx,y < ?'} be the ball of radius r centered at x. 


Definition 2.1 (Doubling Dimension). The doubling dimension d of a measure fi on { — 1, +1}^ is 
the least d such that for each x G {—1, +1}'^ with y,{x) > 0 we have sup^^g A 

measure with finite doubling dimension is called a doubling measure. 


Measures of low doubling dimension capture the observed clustering phenomenon^. It follows di¬ 
rectly from the definition, for instance, that a small doubling dimension ensures that the balls around 
any item type must have a significant mass. In particular, any item type x G { — 1,1}'^ with fi(x) > 0 
has (x, r)) > r'^. Appendix D contains some examples of measures with low doubling dimen¬ 
sion, and the reader is directed there to gain more intuition for the concept. Appendix D also contains 
experiments indicating that the doubling dimension is often small in practice, and describes in more 
detail why it is an assumption strictly more general than the ones made in [4]. 


3 Item-item collaborative filtering algorithm 


In this section we describe our algorithm ITEM-ITEM-CF. The algorithm carries out a certain proce¬ 
dure over increasingly longer epochs (blocks of time), where the epoch index is denoted by r > 1. 
In each epoch the algorithm carefully balances Explore and Exploit steps. 

In the Explore steps of epoch t, a partition of a set of items is created for use in the 

subsequent epoch. Each epoch has a target precision Et (specified below) such that if two items i 
and j are in the same block then usually jij < e^+i. 

In the Exploit steps of epoch r, the partition {Pjl.^^} created in the previous epoch is used for 
recommendation. Exploit recommendations to a user u are made as follows: u samples a random 

fr) ( t ) 

item i from a random block , and if u likes i (L^^i = -fl) then the rest of P^ ’ is recommended 
to u in subsequent Exploit steps. After all items in P^ ' have been recommended to u, the user 

( t ) 

repeats the process by sampling random items in random blocks until liking some item j in P^,', 

ir') 

upon which the rest of ‘ is recommended. 

In the first epoch there is no possibility of exploiting a partition created in a previous epoch, so the 
algorithm begins with a purely exploratory “cold-start” period. The pseudo-code of the algorithm is 
as follows. 


"'The above definition is a natural adaptation to probability measures on metric spaces of the notion of 
doubling dimension for metric spaces (cf. [20], [21] and [22]). As noted in, for instance [22], this is equivalent 
to enforcing that ar)) < a'^■ r)) for any r > 0 and any x G {—with p,{x) > 0. For 

Euclidean spaces, the doubling dimension coincides with the ambient dimension, which reinforces the intuition 
that metric spaces of low doubling dimension have properties of low dimensional Euclidean spaces. 
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item-item-cf(A^) 

1 Algorithm parameters: 

£n = • 630(2d + ll)(d + ,C=r^-^ 

£t = max £jv) ' C, for r > 1 (target accuracy for epoch) 

Mr = - - - - ^ for r > 1 (number of items introduced in epoch) 

Dr = I Af,-, for T > 1 (duration of epoch) 

2 Cold-Start: 

= MAKE-PARTITION(Mi,ei,ei) 

3 Subsequent epochs: 
for r > 1 

do for f = 1 to TV • Dr 

do Ut = random user 

w.p. 1 — Sr', exploit to recommend an item to Ut 

w.p. Er'. Ut explores to help constmct partition } (see Section 4) 

4 Explore: making a partition 

Recall that during epoch t the goal of the explore recommendations is to create a partition } 

of items such that whenever G then 7 ^ < e^+i- We later prove that this can be 

done by executing the routine MAKE-PARTITIOn(Mt-+i,£,-+ 1 ,£t-+i) described below, which at 
any point makes recommendations to a randomly chosen user. Hence, given the random user 
making the recommendation, ITEM-ITEM-CE provides explore recommendations in whatever order 
Make-Partition would have recommended (had it been run sequentially)^. 

Definition 4.1. (£-net) For any £ > 0, a collection of items C is called £-net of the item space 
represented by distribution p on { — 1, -fl}^ if (a) for any pair i,j G C, we have jij > e/2, and (b) 
for any item of type i with p{i) > 0, there exists * G C so that ju < e. 

Make-Partition first finds a net C for the item space (using the subroutine get-net described 
later). To each item in the net there is associated a block in a partition. M randomly sampled items 
are assigned to the blocks as follows; for each sampled item j, an item T G C is found that is similar 
to j, and j is assigned to the partition block P^ (if there is more than one item i similar to j, the 
algorithm chooses among the relevant blocks at random). Finally, the algorithm breaks up large 
blocks into blocks of size on the order of 1 /e. This guarantees that there will be many blocks in the 
partition, which turns out to be important in Theorem 5.1 showing brief cold-start time®. 

Make-Partition(M, £, d) 

1 C = get-net(£/2, (5/2) 

2 M. = M randomly drawn items from item space 

3 for each i € C, let Pi = 0 

4 for each j G M, let Sj = {i € C \ SlMlLAR(i,/, 0.6£, ) returns true} 

5 if|5,| > 0 then Pi = Pi0 {/}, for i chosen u.a.r. from Sj 

6 for each 5, if |Pi| > l/£, then partition Pi into blocks of size at least ^ and no more than l/£ 

7 return [Pi] 

^For instance, suppose that time t is the first in some epoch r. We might have that times f -|- 5 and f + 30 
are the first two explore recommendations of the epoch, then for those two recommendations the algorithm 
makes whatever the first two recommendations would have been in MAKE-PARTITION. If the execution of 
MAKE-PARTITION has finished, the algorithm resorts to an exploit recommendation instead. 

®It is crucial that blocks in the partition are not too small because we would like the reward for exploration 
to be large when a user finds a likable item (reward in the sense of many new items to recommend). Although 
the algorithm does not explicitly ensure that blocks are not too small (as it did in ensuring the blocks are not 
too large) it comes as a byproduct of a property proven in Proposition D.2, which shows that there are not many 
items in the net close to any given item j. 
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The subroutine SIMILAR is used in MAKE-PARTITION; it determines whether most users have the 
same preference (like or dislike) for two given items i and j.This is accomplished by sampling many 
random users and counting the number of disagreements on the two items. 


SIMILAR(l,j, e, i5) 


1 = [eso^lnd)] 


2 for n = 1 to 

3 do sample a uniformly random user u 

4 let ^ 


-r x\^i. uri.u — 

5 if ^ Esampled u^n> 0.9£ then 


6 return False 


7 return True 

The subroutine Get-NET below is a natural greedy procedure for constructing an e-net. Given 
parameters e and 5, it finds a set of items C that is an e-net for p with probability at least 1 — J 
(proven in the appendix). It does so by keeping a set of items C and whenever it samples an item i 
that currently has no similar item in C, it adds i to C. 

GET-NET(e,(5) 

1 C = 0, COUNT = 0 

2 MAX-SIZE = (4/e)'^, MAX-WAIT = In ( 2 max-size ^^ ^ = ,5/ ^4 . MAX-WAIT • MAX-SIZE^) 

3 while COUNT < max-wait and \C\ < max-size 

4 do draw item i from p 

5 if siMiLAR(i, e, 5') for any j G C then 

6 COUNT = COUNT +1 

7 else C = C U i, COUNT = 0 

8 return C 

5 Main Results 

5.1 Cold-start time and regret hound 

In Section 3 we described how ITEM-ITEM-CF starts recommending items to a user as soon as it 
finds one item that the user likes. This leads to a short cold-start time. 

Theorem 5.1 (Cold-Start Performance). Suppose assumptions A1 and A2 are satisfied. Then the 
algorithm ITEM-ITEM-CF has cold-start time Tcoid-start = 

Hence, the algorithm ITEM-ITEM-CE has cold-start time 0(l/i/) for N sufficiently large. This 
differs from that of [4] for user-user paradigm, where the cold-start time increases with user space 
complexity and the effect is not counteracted with more users present. 

The next result shows that after the cold-start period and until a time Tmax, the expected regret is 
sublinear. 

Theorem 5.2 (Regret Upper Bound of ITEM-ITEM-CF). Suppose assumptions A1 and A2 are 
satisfied. Then ITEM-ITEM-CF achieves expected regret 



max 


( 2 ) 
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The reader is directed to the proof in the appendix for the exact constants. Also note that Tmax in¬ 
creases with N and the asymptotic slope e n decreases as a function with N, both of which illustrate 
the so-called collaboration gain. Finally, the regret bound in Theorem 5.2 has an asymptotic linear 
regime. The next result shows that with a finite number of users such linear regret is unavoidable. 

Theorem 5.3 (Asymptotic linear regret is unavoidable). Consider an item space p satisfying 
assumptions A1 and A2. Then any online algorithm must have expected asymptotic regret 
E [7^(T)] > C{v, N) ■ T, where C{v, N) = {I - 2v)/N. 


5.2 Comparison with user-user CF 


In this section we will contrast the cold-start performance of user-user collaborative filtering to that 
of our item-item algorithm. In particular, we give a heuristic argument showing that the cold-start 
time grows with the complexity of the user space. This is in contrast to our Theorem 5.1, where 
for any doubling dimension of the item space, if there are sufficiently many users then the cold-start 
time is independent of system complexity. 

We consider a simple scenario with K user clusters. First, let denote the probability that users 
u and V agree on an item randomly drawn from the item space. We have K equally sized clusters of 
users, such that = 0 for users u, v in the same cluster, and G (O.lo, 0.2iy) for users u, v in 
different clusters. 

Consider now a given user u. A user-user algorithm seeks to find another user v who is similar to 
u, so that the items liked by v can be recommended to u. In order to recommend with at most (say) 
0.1 probability of error, the similar user v should have distance at most Q.li/. The extra factor o 
is present because inference can only effectively be made from the i/ fraction of liked items. 

Concretely, we sample a random user v, and attempt to decide if it is from the same cluster as u. 
Suppose u and v have rated q items in common. The problem then reduces to a classical hypothesis 
test; after observing q items in common from two users, determine whether or not they are from 
the same cluster. The goal is to understand what is the minimal value of q needed so that the above 
procedure works with at least probability 1/2. 

We consider the maximum a posteriori rule for deciding that v is from u’s cluster. If u and v disagree 
on any single item, then they cannot be from the same cluster. Conversely, if u and v agree on all q 
sampled items, the MAP rule declares v to be from it’s cluster only if 




1 

K ■ 


This means that if q is too small, we will never declare v to be from u’s cluster and therefore will be 
unable to make recommendations. Rearranging gives q> Cl {\og{K)/v). 

Hence, an algorithm based on user similarity needs at least T = Cl {\og{K)/o) steps simply to 
determine if two users are similar to each other, a prerequisite to making good recommendations. In 
contrast, we have shown that ITEM-ITEM-CF achieves cold start time 0{l/o), which in particular 
does not increase with the complexity of the item space. 

This contrast between cold-start times highlights the asymmetry between item-item and user-user 
collaborative filtering. It is much faster to compare two items than it is to compare two users: it 
takes a long time to make many recommendations to two particular users, but comparing two items 
can be done in parallel by sampling different users. 


6 Discussion 

This paper analyzes a collaborative filtering algorithm based on item similarity, and proves guaran¬ 
tees on its regret. Our algorithm exploits structure only in the item space. It would be desirable to 
have a matching lower bound, in the spirit of lower bound for multi-armed bandits in metric spaces 
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shown in [5] and [6]. Furthermore, many practitioners use a hybrid of user-user and item-item 
paradigms [23] and [24], and formally analyzing such algorithms is an open problem. 

Finally, the main challenge of the cold-start problem is that initially we do not have any informa¬ 
tion about item-item similarities. In practice, however, some similarity can be inferred via content 
specific information. For instance, two books with similar words in the title can have a prior for 
having a higher similarity than books with no similar words in the title. In practice such hybrid con¬ 
tent/collaborative filtering algorithms have had good performance [25]. Formally analyzing such 
hybrid algorithms has not been done and can shed light onto how to best combine content informa¬ 
tion with the collaborative filtering information. 
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A Correctness of Explore 


This section of the Appendix establishes correctness of the explore procedure as well as some of 
its properties that will be utilized for establishing the main result of the paper. Concretely, we will 
prove that with high probability the procedure MAKE-PARTITION produces a partition of similar 
items during each epoch. To that end, in appendix A.l, we prove that SIMILAR succeeds in deciding 
whether two items are close to each other. In appendix A.2 we prove that the procedure GET-NET 
succeeds in finding a set of items that is an e-net for p. We then put all the pieces together and prove 
that MAKE-PARTITION, the routine at which the explore recommendations are aimed at completing, 
succeeds in creating a partition of similar items. Finally, in appendix A.3 we prove that with high 
probability during any given epoch there will be enough explore recommendations. 


A.l Guarantees for similar 


The procedure SIMILAR is used throughout GET-NET and MAKE-PARTITION. It tests whether two 
items are approximately £-cIose to each other. 

Lemma A.l. Let i and j be arbitrary items, S,e € (Oj 1)> Sij be the event that 
SIMILAR(j, j, e, 5) returns TRUE. Then we have that 


(i) ^ 0.8£, then P {Sij) > 1 — <5, and 

(ii) ifyij G [ke, {k + 1)£) where k G {1,..., [jJ}, then P (Sij) < | 


Proof. Let us begin with case (i), where 7 ^- < 0.8£. Let be the event that the ran¬ 
domly chosen user disagrees on i and j (i.e. that user likes exactly one of i and j), and note that 
® {'l2n=i 1a„) < O-Seq^^s- Then, by the Chernoff Bound (stated in Theorem E.l), we get 




{S^J I < 0.8£) = P ^ 1^„ > 0.9eq,^s < P ^ 1a„ > (1 + .1) E ^ 1a„ 


(3) 


< exp — 


\n—l 
.12 


\n—l 


\n—l 




2 -f .1 


e[Y,v 




Now, since q^^s = [630^^ In (y)] > 210 • | j In (j), we get 

P I < 0.8£) = P ( ^ 1a,, > 0.9£9e.5 ) < 


(4) 


Let us now consider case (ii), where jij G [ke, {k + 1)£). As before, let An be the event that the 
randomly chosen user disagrees on i and j. Then, since k > 1 and again by the Chernoff bound 
we get 


C ge.S ^ 

1 a „ < 9-9£qs,s 

n=l / 

/Qe,S /qe,5 \\ 


- 1 1 ^ ®^P ( 


n—1 
2 / 


\n—l 


2 + .1 

\ \rL—i / / 

In order to get the desired conditions we need to show that 


^ ^ 1 


1 


210 ' 




(5) 


( 6 ) 
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By taking natural log of both sides we get 


1 /4fc2' 

— > dln(4fc) +ln ( — ) , 


which is in turn 


at most (d + 1) In • Hence, it suffices to show 


that 


1 


However, since we have > 630^^ In (|), we get 


> 5^630^ 1„ (^ 1 ) ke = 3(<i+ 1)H„ (^ 1 ) . 

Now since 3fc > ln(16fc^) for fc > 1, we conclude that 

P(-S'*j I lij e [ke, {k + l)e)) = P 1a„ < O.Qege.a^ < ^ p, 


(7) 


( 8 ) 


(9) 


( 10 ) 


as desired. ■ 

In the Lemma above we showed that given two items i and j, the routine SIMILAR can tell that the 
items are e-close when 7 ^^ < 0.8e, and it can tell that they are not when jij > e. Furthermore, the 
Lemma states that the probability of a false-positive decreases extremely fast as the items get farther 
apart. The Lemma below shows that, when one of the items is drawn from /r, SIMILAR still works 
and that the false positive rate is small, despite the possibility that the it may be much more likely 
to draw an item that is far from i. In Lemma A.2 below we use the doubling dimension of n for the 
first time, and in this context the doubling dimension guarantees that SIMILAR (which is a random 
projection) preserves relative distances. 

Lemma A.2. Let i be an arbitrary item, let J be a randomly drawn item from an item space p, of 
doubling dimension d, and let Sij be the event that SIMILAR(i, J, e, 5) returns TRUE. Then we have 

P {lij > e I Sij) < 5. 


Proof By Bayes’ rule we get 

P > e I S^J) = 


P (7i,J > Sij) 


P > e, Sij) + P ( 77 J < £, Sij) ’ 


( 11 ) 


where the probability is with the respect to the random choice of J and the random users in SIMILAR. 
Now if 

dP [Sij,^u < e) > (1 - (5) P {Sij, jij > e) (*) 


holds, we get 


P hi.j > £ I Sij) = 


1 


< 


1 I lF’(7i,j<e.-Sij) — 1 I t-S 
^ F( 7 i.J>e.Sij) ^ 


= (5. 


( 12 ) 


Hence, it suffices to show (*). Recall that B {i, r) is the ball of radius r centered at i, and note that 

P ilij < £, Sij) > P [Sij I 7 ij < e/2) p, {B (i, e/2)) , (13) 

and 

rios2(i)i 

¥{l^J>e,Su)= ^{S^J\l^J&[2'^e,2'^+^e))p{B{i,2^+^e)-B{l,2'^e)) (14) 


12 








( 15 ) 


riog 2 (j)i 

< ^ |7.jG [2'=e,2'=+i£))/i(e(i,2'=+ie)). 

k=0 

Let us first lower bound F{Sij,"fij < e/2) /r (S (f, e/2)). Let p = p.{B {i,e/2)). Then by 
Lemma A. 1 we get that 

P {Sij I 7iJ < e/2) (B {i, e/2)) > (1 - 5) p. (16) 

We will now upper bound P {Sij, jij > e). Using the doubling dimension of the item space, which 
implies that p {B^ (i, 2^+^e)) < p, we also get that 

riog 2 (i)i 

P(7.J>e,5zj)< X] P(*5 zj|7zjG [2'=e,2'=+ie))/r(S^(z,2'=+ie)) (17) 

fc =0 


riog 2 (^)i 

< X] I e [2'=e,2'=+ie)) (2^-+")%. 

k^O 

We now use the second half of Lemma A.l, and arrive at 


P iliJ > £, S^j) < 



(2 


Ai+ 2 \ 


I 6^ 1 6 

P<Pll^^<P2- 

k=0 


We can now check that indeed sufficient condition from eq. (*) is satisfied: 

c 

(5P {Sij, ■jij < £)> 5p{l- S)> -p (1 - (5) > (1 - (5) P {S^j, ^ij > e), 

which completes the proof. 


(18) 


(19) 


( 20 ) 


A.2 Making the Partition 

In the previous section we proved that the procedure SIMILAR works well in deciding whether two 
items are similar to each other at some desired precision. In this section, we will prove that with 
SIMILAR as a building block can partition items into blocks of similar items. 

We will begin by proving that the subroutine GET-NET, used in the beginning of MAKE-PARTITION, 
succeeds at producing an e-net of items with high probability. 

Lemma A.3. With probability at least 1 — (5 the routine Get-NET(M, e, i5) returns an e-net for p 
that contains at most (|)^ items. 

Proof. Let us first settle some notation. Let Cfinal be the set returned GET-NET(e, <5), let Cr be 
the set C when it had r items, and let Af^ be the set of random items drawn when C had r items. 
Furthermore, denote by P be event that for each i,j G C final we have yij > e/2, and C the event 
that for each item i there exists a / S Cfinal such that fij < e. Furthermore, let Eij be the event 
> 0.5e} U {Sfj,yij < 0.8e}, and 


u u 

r —0 j^J^r CGCr- 

Intuitively, the event E happens when some call to SIMILAR returned an erroneous answer. We will 
show that 
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{A) F{E) < 5/2, 

{B) P (P'^ I P=) = 0, and 
(C) P(C^ I E^) < 5/2, 

which together show that GET-NET{e, 5) returns an e-net with probability at least 1 — (5. 
Proof of {A): By a union bound we get 


|C/i„ai| —1 MAX-SIZE 

p(^)< E E E E 

r = 0 jGMr-CGCr- r=0 j^MrC^Cr 

and since there are at most Max-WAIT items in Air and at most MAX-SIZE items in Cr we get 
P(P) < (max-size)^ • MAX-WAIT • P(Pcj) < <5/4. 
where the last inequality follows since Lemma A. 1 gives us that 

P {Ec,j) < 5' = 5/ [i- MAX-WAIT • MAX-SIZE^) . 

Proof of {B): Note that if there are two items i,j € C such that 7^ ^ < e/2, then this must have 
happened as a result of some erroneous response of SIMILAR (i, j, 0.6£, (5'). However, since we are 
conditioning on E'^ no such erroneous response can occur. 

Proof of {C): Let us consider the two cases, when \Cfinai\ = MAX-SIZE, and when \Cfinai\ < 
MAX-SIZE. Then 

P(C= I <P(C= I I = max-size)+ P(C'= I < max-size) , 

and we will show that 

(Cl) P(C^ I Ef \Cf,riai\ = max-size|) = 0, and 
(C2) P(C^ I Ef \Cf,r.al\ < max-size|) = 5/2, 

which together prove (C). 

Proof of (Cl): Note that since we are conditioning on E^,, which in turn implies that Cfinai is a 
packing, we have that 7 ^- > e/2 for each i,j G Cfinal, we get that B^{i,e/i) D B{j,e/4:) =0 for 
each i,j G Cfinai as well. Hence 


y P(z,£/2) U B{t,e/i) 




iiSC/i. 


n{B{i,e/i)). 

ieCfiral 


Now by the doubling dimension condition we get that p {B (i, e/2)) > and hence can con¬ 

clude that 



and hence for any item there exists an item in Cfinai such that fij < e/2. 


Proof of {C 2 )' Consider now the case in which \Cfinai\ < MAX-SIZE. This means that at some 
iteration r G {0,..., MAX-SIZE — 1} of the while loop of there existed an item j such that 7 ij > e 
for each i G Cr but the algorithm nevertheless terminated and returned Cr- Let Tr be the event that 
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the algorithm terminated at round r while there still existed an item j which is not e close to any 
item in Cr- Then C"^ C Tr, and hence 

P(C‘= I I < max-size) < ^ P(T, 

r—0 

and so it suffices for C 2 to show that P (T^ | E'^) < | ^ax-size 

To show that P (Tr I E'^) < ^ we will first observe that if there is an item 7 which is e far 

from all of Cr, then the ball B (j, e/5) must have significant mass which is all also not close to any 
item in Cr- We will then conclude, by a standard coupon collector argument, that this mass is found 
with high probability. 

By the doubling dimension condition, the ball B(j, e/5) must have mass at least (e/5Y. Let Mr be 
the event that no item in B{j,e/5) was sampled during the of the while loop. Then 

p (Tr I L;^) < P (Tr I M/) +F(Mr\ E’^) . 


Note that given M/, which implies that an item j which is at least 0.8e away from all of Cr was 
sampled, the event Tr happens only if j is judged to be similar to some c G Cr- However, since 
we’re conditioning on E^ that cannot happen, and we get P (Tr \ M/) = 0. 

Finally, we will use the coupon collector argument and show that P (Mr \ E'^) < S/(2- MAX-SIZE). 
The event Mr, that no item in B(j,e/2) was sampled during the iteration of the loop happens 
with probability at most 



< exp 



MAX-WAIT 


^ (j 
“ 2 • max-size’ 


as we wished. 


It is now only left to prove that the main tool used during exploration, Make-Partition, indeed 
produces a partition of similar items. 

We now prove that with high probability the procedure MAKE-PARTITION creates a partition of 
similar items. Furthermore, the additional properties stated in the Lemma, regarding the size of the 
blocks, will be crucial later in ensuring a quick cold-start performance. 

Lemma A.4. Let e,S G (0,1), and let M > 12 ■ In (f Then with probability at 

least 1 — (5 the subroutine MAKE-PARTITIOn(M, e, 5) returns a partition {Pk} of a subset of M 
randomly drawn items such that 

(i) For each block Pk and i,j G Pk we have "fij < 1.2 • e, 

(a) Each block Pk contains at least T items, 

(in) Each block Pk contains at most 1/e items, 

(iv) There are at most 2Me blocks. 

Proof We will show that properties (i) and (ii) hold with probability at least 1 — (5, and note that 
(Hi) follows directly from the algorithm and that (iv) follows from (ii). 

Let C be the event that the set C returned by GET-NET is not an |-net for p, and let Af be the set 
of M items sampled. Similarly to in the proof of Lemma A.3, let Eij be the event , 7ij > 
0.6e} U < 0.5e}, where Sij is the event that routine SlMlLAR(i, 0.6e, (5/(4M|C|)) 

returns SIMILAR. Intuitively, the event Eij happens when SIMILAR returns what it shouldn’t have. 
Furthermore, let E = UjgTW 
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Let F be the event that for some block Pk there exists i, j S Pk such that jij > 1.2e, and let B be 
the event that all blocks in the partition have size at least Therefore, = IJ^, B^, where B^ is 
the even that Pk has size less than 

Since event F guarantees condition (i) and event B guarantees condition {ii) it suffices to show that 

P U B^) < S. 

We will do so by conditioning on and E'^ where after a couple of union bounds we arrive at 

P (F'^ U 5=) < P {F'= I C"=, F^) + P (F^ I F=) + P (C) + P (F), 


and showing 

(A) P(F^ I C^F^) < S/2, 
(F) P(F'= I C^F=) = 0, 
(C) P(F) < 5/4, and 
(F) P(C') < 5/4 


completes the proof. 

Proof of [A): Note that B^ = ^c’ where Be is the event that the block Pk^ constructed with 

c G C as reference has size at least A. 

Note that the event Be happens whenever at least ( (which is at least 1/e) items are added 
to Pk^. This is because in this event, F^^ will receive more than 1/e items assigned to it and hence 
the algorithm will break it up in to smaller block but each of them will be of size at least ^. Hence 


where Xe,n is the event that the item sampled ends up in Pk^. We will show how 

P(X,,„|C^F'=)>(e/12)‘^ (21) 

allows us to prove (A), and we then prove eq. (21). Note that the {Xe,n}n are independent, and 
hence the sum n stochastically dominates the sum ^c,n, where each is an 

independent Bernoulli random variable of parameter (e/12)‘^. By the Chernoff bound we then get 


C M A , 

where the last inequality is due to M > 12 In ■ Hence we arrive at 

P(F^|C^F^)<^P(F/|C^F'=)< ^(|)" = 5/2 


5 /e\d 




as we wished. 

Proof of eq. (21): We can lower bound P {Xe,n \ F'^) as 

P {Xe,n I Cf F") > P {Xe,n \ Cf Ef < e/2) P,„ < e/2), (*) 

where G is the item drawn during MAKE-PARTITION. 
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Now note that the event occurs when (a) SIMILAR(c, jn, 0.6e, (5/(4M|C|)) returns TRUE, (b) c 
is chosen uniformly at random among the other items in C that are also similar jn- 

Conditioning on E'^, C^, and 7c,< e/2 guarantees that SIMILAR(c, 0.6e, (5/(4M|C|)) returns 
TRUE, and hence P (Xc,„ | < e/2) = l/itT, where K is the number of items in C 

for which SIMILAR also returned TRUE. By Proposition D.2 (with r ^ 0.6e and e e/2 in the 
Proposition), we get that if < (i? (c, e/2)) (| + {c,e/2)). Now noting 

that in eq. (*) Pj^ (7c, < e/2) = ii{B (c, e/2)), we arrive at 


P(Xc,„ I C^,E-) > P(Xc,„ I C^^;^7c,,„ < e/2)P,„ ( 7 c,, ,, < e/2) 

1 


- tl2W 


(^)‘^M(S(c,e/2)) 


Ai(S(c,e/2)) =(^) , 


( 22 ) 


which proves eq. (21) and hence (7l). 

Proof of {B)\ The event B‘^ happens when for some block Pk there are items i, j G Pk such that 
jij > 1.2e. Conditioning on C^, however, this can only happen if ^ck,j > 0-6 ’ e (or 7cj,,i > 0.6 • e) 
but SlMlLAR(cfc, j, 0.5e, (5') returned SIMILAR nevertheless. Conditioning on E'^, however, that 
cannot happen and we get P | C"^, E^) = 0. 

Proof of {Cy. By Lemma A.l,P(£;c,/) < and hence 


m) < E E 

cGC j^M 


\C\ M 
4 (4f£)^M 


S 

< - 


<S/4, 


where the last inequality is due to |C| < (d/e)^^, as guaranteed in Lemma A.3. 

Proof of {D): This follows directly from Lemma A.3. ■ 


A.3 Sufficient Exploration 

During epoch r the algorithm uses any given recommendation to be an explore recommendation 
with probability Et- In the Lemma below, we should that during each epoch there are enough 
explore recommendations for the procedure MAKE-PARTITION to terminate. 

Lemma A.5. With probability at least 1 — Er+i, during the epoch the algorithm has enough 
explore recommendations for MAKE-PARTITION Et+i, Et+i) io terminate. 

Proof. It suffices to prove the following two facts: 

(A) the number of times explore is required for MAKE-PARTITION E^+i, Er+i) to ter¬ 

minate is at most ^ErDT.N for each r, and 

{B) with probability at least 1 — E^+i we have that explore will be called at least ^ErD^ times. 

Proof of (A).- Let us denote by MP{t -f 1) the number of explore calls required for the routine 
MAKE-PARTITION {M.r+ 1 , £t+i, £t+i) to terminate. Then we want to show that MP(t -f 1) < 
^Et.Dt.N, or equivalently that 

-MP{T+l)/Dr<N. (★) 

£ 

Note that (which we soon check) make-partition(M 7 -, Cr^Sr) makes at most 

MP{t -I- 1) < 4 • 630 {d + 1)^ Mr+i In^ (M^+i) (23) 


17 







recommendations, and since Mr = ^t+i < Mr • 2^^+^ we get 

±^MP(r + l)< (^)"‘4^630(<i+lf2^«l„^g)l„(^l„(lfe)) (24) 


< . 630(d + 1)3 In" In 


'16\, /23"*+13 


V I ■ 


(25) 


is simple to show that In^ In d+2 ^ < ^(2(i+ll)(d + 2)^, which we can use to further 
)und ^ ^MP{t + 1) as 

l^MP(r + l)< (2“4“.63D(<i+l)»(l)"’) (24+11)(<J+2)1' 


oSd+lS 1 

-630(2d+ll)((i + 2)4 


d+5 ■ 


and since Sr > Sn = ^ • 630(2(i + ll)(d + 2)"^^^ we get that eq. (★) is satisfied and 

we are done. 

Proof of [B): explore is called with probability Sr at each of the recommendations of epoch 
T. Hence, all we need to show is that P (Bin {DrN, Sr) < ^SrDrN) is at most Sr+i- This follows 
from the Chemoff bound; 

p(Bin(D^lV,e^) < = P^Bin (D^IV, e^) < (1 - 0.5) E [Bin e^)] ) 

< exp (^~ 2 !|!’o 52 ^ [Bin(i:)^lV,e^)]^ = exp < Er+i, 

where the second to last inequality follows from Dr 

B Quick Recommendations Lemma 


In Section 3 we described that the algorithm, which starts recommending to a user as soon as it 
knows of one item that the user likes. Below we show that indeed shortly after the beginning of the 
epoch the slope of the regret is small. 

Lemma B.l (Quick Recommendations Lemma). For t > 1 let TZ^'^\T) = ^ ~ 

Ljj-ijt) denote the number of bad recommendations made to users during the first TN recommen¬ 
dations of epoch T. Then we have 


E 


TZ^^^T) 


< 


(26) 


whenever T G [Train, t, Dr] and where Tmin,T — For T < Train,T, we trivially have 

E (T)] < T. 


Proof Let TZ'^D (T) denote the number of bad recommendations made to users during the first TN 
exploit recommendations of epoch r. Then, since the expected number of explore recommendations 
by time TN of epoch r is SrTN, we get that 


E 




< ErT + E 


nM){T) 


(27) 


Furthermore, as described in the algorithm, during the epoch t — 1 the algorithm spends a small 
fraction of the recommendations, in the explore part, to create a partition {Pk} (which we call 
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pseudocode) of Mr random items to be exploited during epoch r. Let Er be the event 
that the partition to be used during epoch r satisfies the conditions specified in Lemma A.4 

(with M = Mr e = Er, 5 = Er)- Then we get 


E 




< ErT + E 


n'^p{T) 


< 2 ErT + E 


I Er 


(28) 


where the last inequality is due to Lemma A.4, which guarantees that P(fr) < Er- 

For the remaining of the proof, we will show that E [TZ'{T)^'^'^ (T)] < ^ErT. We will do so by first 
rewriting in terms of the number of bad exploit recommendations to each user 


E 


I Er 


< —E 
- N 


I Er 


where 71u^\t) is the number of bad recommendations made to user u during the first TN exploit 
recommendations of epoch r. We will now bound the latter term by conditioning on a nice property 
of users (which we will characterize by the event gu,T), and showing that this property holds for 
most users. Let g^.x be the event that user u has tried at most blocks during the first TN 

recommendations of epoch r (we omit r in the notation of g^^r since it is clear from the context 
here). Here we use notation pP'> = to denote the total number of blocks in the partition 

for epoch r. Then we get 


—E 
N 




< IVe 

- N ^ 


I p„.t 1 {glx I £r) 


We dedicate Lemma B.3 to showing that P P [g^^ j, \ Er) < Hence, it suffices to show 

that for each T > T^in,T we have 


1 

N 


J2^\n^^P{T)\Er,gu,T 


104 

< - ErT, 

V 


(★) 


which we prove now. We will first rewrite the regret by summing over the number of bad recom¬ 
mendations due to each of the blocks as 




E 


n'P\T)\Er,gu,T] 


^ I £T,gu,T 


where Wu,k,T is the random variable denoting the number of bad exploit recommendations to user 
u from block Pj. among the first TN exploit recommendations of epoch r. We can further rewrite 
this as 


1 

N 




Wu,k,T 

_ k 


•) 9u,T 


P ^ ® [^u,k,T I Er,gu,T, Su,k,T] P {Su,k,T \ Er,gu,T) , 

k 

(29) 


where Su,k,T denotes the event that by time T user u has sampled an item from block Pk- Note that 
the reason why the natural term E Wu^k,T I ^T^gu,T, kT k t I ^T:gu,T^ is absent from 


the expression above is because E 
item from the block. 


Wu^k,T I gu,T, S^kT 


= 0 since the user hasn’t sampled an 


Now note that by conditioning on , we know that user u has sampled at most IQ^pP^ blocks. 
Now given g^^r as well as Er, the indices of the sampled blocks are not revealed. Let Khea random 
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variable that selects one of the indices of the blocks uniformly at random. Then, it follows that with 
respect to randomness in K, 




= 16' 


T 


^iSu,K,T I ^T.gu.r) < - 

We can re-write (29) in this notation and apply the above discussed bound to obtain 


(30) 


— Ve 

N ^ 


Wu,k,T I £t, 


9u,t 


< 


p(r 


N 

pi-r) 


E \Wu,K,T I 5«,T, Su,if,T] P (Su,if,T | 


< -y]] E \Wu^K,T I ffu.T, Su,K,t] 


N 

16T 

ND-r 


16T 

1^ 


EE ^\Wu,k,T \ £T^9u,T^Su,k,T] ■ (31) 


u k 


The right hand side above can be bounded as 

16T 

< 


y](l + 1.2e.|pM|)iV) 


ND, 

Lemma B.2 ^ 

Mr 104 

< 52SrT—— = - SrT. 

Dr jy 

Lemma AA 


(32) 


To see the above two inequalities, consider the following. The first inequality follows from 
Lemma B.2 by realizing that each block, corresponds to collection of items such that for 

(r) 

any i,j G P^. , we have ^ij < 1.2eT-. The second inequality can be argued as: from Lemma A.4, 
l/2eT < \P‘k\ — and hence 

y](l + 1.2e.|pM|) < y](3.2e.|py)|) 

k k 

<3.2e.(y]|py)|) 

k 

= 3.2erMr. 


This, along with simple calculation, completes the proof of eq. ( 


★). 


The lemma below was used in Lemma B.l. Informally, it says that our recommendation policy, 
which recommends the whole block to a user after the user likes an item in the block, succeeds in 
finding most likable items to recommend and in not recommending many bad items. 

Lemma B.2 (Partition Lemma). Let Pk be a set of items such that for each i, j G Pk we have ■jij < 
e, and consider the usual recommendation policy that ITEM-ITEM-CF uses during its “exploit” steps 
(where when user u samples a random item i Gr Pk, only if u likes i will u be recommended the 
remaining items). Let Su,k be the event that user u has sampled an item from Pk, let Wu,k (W 
for wrong) denote the number of wrong recommendations made to ufrom Pf^, and let (A for 
absent) denote the number of items in Pk that u likes that are not recommended to u. Then we get 

y^ E [Au^k + Wu,k I Su,k] < (l + £\Pk\)P^- 

U 

Proof For each block Pj. and user u, and let ^ = \{i G Pk \ L^ i = +1}| denote the number 
of items in Pk that u likes. Note that E [A^^k \ s-u.fe] = (.u,k ■ E \Wu,k \ Sii,fc] = 
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{\Pk\ — ^u,k) • + 1 • because with probability ^u,k/\Pk\ user u will sam¬ 

ple an item from P^. that u likes and will then be recommended {\Pk\ — £u,k) bad items, and with 
probability {\Pk\ — iu,k)/\Pk\ the hrst item recommended to u is bad. Likewise, with probability 
(l^fel — ^u,fc) /\Pk\ the user will sample an item that the user dislikes, and then fail to be recom¬ 
mended £u,k items that the user likes. Hence we have that 


E 


+ WL 


^u.k 


E 


(l-Pfel — £u,k) 


+2E 


£u,k {\Pk \ — £u,k) 


(33) 


<N 


Now, 


2E 


iu,k{\Pk\-£u,k)'^^“°'' 2 


Putting it all together we get 

IE ^ ^ ^u,k ^^u,k I Su,k 


E E 1 




“ iJePk 


= — 1 
ip.i 








% 1^*1 iS/" 


(34) 


^ ^ {\Pk\-iu,k) ^ 2 ^ ^u,k{\Pk\-L,k) ^ ^ 

(35) 


<N 


e\Pk\N 


The lemma below was also needed in the proof of Lemma B. 1. 

Lemma B.3 (Auxiliary Claim). Consider an arbitrary epoch t, and let be the event that by the 
{TNy^ exploit recommendation of epoch t user u has tried at most 16^ | blocks from the 

partition {P^^^} constructed during the MAKE-PARTITI0n(M,-, £,-, £t) of the previous epoch, and 
let £-r be the event that {Pj.^^} satisfies the conditions specified in Lemma A.4. Then 

1 ._. 49 

u 

holds for any T G ln(l/£T-), Pt]- 
Proof Let us consider few dehnitions; 

1. Let be the event that by the exploit recommendation user u has been recom¬ 

mended at most l.ir items, and that u likes at least O.dvMr among the items in {P^^. 

2. Let Hu,t be the event that by the TN*^ exploit recommendation there are still at least 

items liked by u in blocks that haven’t been sampled by u. 
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Then we get 


V{glr\£r)<V{glT 


Hu,T,Nu,T,£r)+V{K^T 


Nu,T,£r)+V{NlT\£r)- 


We will show that for T S ln(l/eT-), Dt] 

{A) F{g^ j, I Hu,T,Nu,T,£r) < £t 
(B) jjEu^{Hu,T\Nu,T,£r)<fer 
iC) ¥ \ £r) < £r. 


from which the lemma follows. 

Proof of (A): Note that by conditioning on Nu,t we know that there were at least 0.2i/Mr items 
likable to u for each t < T. Now let £u,k denote the event that u likes the first item sampled from 

(t) 

block . Then we have that 




\^9'a,T^£T, Au,T, C I ^ \Pk^ \ • ^ ^■^T,£r, Nu,T, Hu,T 


(36) 


where Pj,^ is the n*" block sampled by u. This in turn gives us 

. X / 16 ^|{Pr’}| X 

I £r.N^,T,H^,Tj < p( 51 ^ I j, (37) 

and we will now prove that the latter is at most Et- 

First, note that by conditioning on St 7 we are guaranteed that |Pfc^ | > l/2e^ by Lemma A.4. Hence 

<p E 1. 


< 2.2T£t I Sr, Hu,T7 Nu^T ) • 


n—1 


We will now show following two claims, which in turn implies (A) in light of the above discussion. 


(Ai) For each n, P{£u,k„ \ £t 7 stochastically dominates a Bernoulli random vari¬ 

able with parameter O.li/. 

(A2) Let {Xn} be a set of independent Bernoulli random variables with parameter O.lz^. Then 


p 


E 


Xn < 2.2 T£t < £ 


(38) 


\ 


n—1 


Proof of [Af): Since we are conditioning on there are still 0.2i'Mt liked items in unsampled 

blocks, and since we are conditioning on St, there are at most 2 Mt£t blocks and each of size at 
leaste 1/2 £t- and at most 1 /£t. It can be easily argued that the setting in which sampling a random 
block yields a likable item is least likely is when all likable items are in the largest blocks. By St, 
we can “fit” 0.2vMt in at least 0.2j^Mr/(l/eT-) = Q.2£tvMt blocks. Therefore, 


P(4.fc„ I ST,Hn,T,Nn,T) > = O-li^. 

2i Ivl tSt 


( 39 ) 
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Proof of {A 2 ): Let {Xn} be a set of independent Bernoulli random variables with parameter OAi/. 
Then the sum of 16;^|{P^^^}| such i.i.d. random variables have average equal to 

= Z.2Ter, (40) 

where we have used the fact that since by £r each block is of size at most 1 /et- and number of items 
Mr, we have at least MrSr blocks and Dr = MrV 12. Then 







\ 

^ < 2.2T£r 

<P 

E 

n—1 

-0.25)E 

1 

W 1 

_1 

) 


(41) 


where the inequality is due to (40). Now by the Chernoff bound we get 


exp - 


0.252 

2 + 0.25' 


-E 




n—1 


< exp — 


i.2Ter 

36 


(42) 


Therefore, if T > ^ ln(l/ET-), then the right hand side is less than Sr, as desired. 

Proof of (B): First, for each u it is clear that P(iT° j, \ N^^t) < | Nu,Dr) holds, since 

by end of the epoch the user will have explored the most (recall T < Dr). Recall that under event 
Nu, user u has been recommended at most l.lDr items and that at least Q.QvMr items in the 
partition are likable to u. For ^ to happen, that is, for there to be at most 0.2iyMr at the end of 
DrN exploit recommendation (overall), it must be that at least 


0.9:^M^ - 0.2vMr - l.lDr = O.lvMr - 0.55uMr = 0.15vMr. (43) 


many items liked by user u are “wasted”. Using notation from the proof of Lemma B.2, formally it 
can be written as 


>0.15j.M,. (44) 

k 


By Markov’s inequality we get that 

> 0.15j^M.r I £r,Nu,T^ < > Q.lbvMr I £r,Nu,T^ 

k k 

+) 'P( Efe ^k,u > O.lbl^Mr I fr) 

P(iV„,T) 

ib) P(Efc4lfe.. >0.15z^M, 

“ 1 — Et- 

^ Efc ^k,u I £t\ - 

Q.lbvMr ’ ^ ’ 

where inequality (a) uses the fact that P{A\B) < P{A)/P{B)-, inequality (b) uses the fact (C) (we 
note that (B) is NOT used to prove (C) and hence there is no circularity of the argument); and last 
inequality uses the fact that Sr < 1/2 for all t. We note that effectively, we have assumed away that 
Su,k,Dr has happened for all k for the given user u. Therefore, we will utilize Lemma B.2 to bound 


P(iV„,T) 

{Ek^k,u>0.15l^Mr\£r) 

1 — Er 
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the right hand side as follows; 


1 

N 


>Q.‘ibvM-r \ SttNu^T^ < 


< 


1 ^ 2E [ Ek^k.u\£r] 
O.lBl'Mr 

(^(1 + S-rl-Pfc 


0.15i^MrN 


(a) 

< 


40ei- 
vMr 

40eT 


{J2^er\p^ 

I 

Ei^- 


(46) 


where (a) uses the fact that under 8r, \Pk \ ^ l/2£r for all k, and we have used the fact that all 
partitions sum up to Mr. 

Proof of {C): The event happens whenever a user u has been recommended more than 

1 .1Z?7- = 0.55i^Mr times by time T, or when the users likes less than 0.95i'Mr. The probabil¬ 
ity that user u has been recommended more than 0.55i^Mr items by time T is greatest at T = Dr, 
and, by Chernoff bound, is 


Bin ( NDr, — ) > 1.1Z?T- ) < exp ( — 


.12 


2-f 0.1 


-A 


(47) 


which, since Dr > 210 ln(-^), is at most £r/2. 


The probability that a user u likes less than Q.9vMr among the Mr items can be also bounded using 
a Chernoff bound; 


’(Bin(Mr, < O.dvMr) < exp ( —- 7^-MrV 


2-f 0.12 


< £r/ 2 , 


(48) 


where the last inequality is due to Mr > 210 In(^). 

C Proof of Main Results 


In the previous section we proved Lemma B.l, which states that shortly after the beginning of each 
epoch, the expected regret of the algorithm becomes small. This allows us to prove our main results 
below. 

Theorem 5.2 (Regret Upper Bound of ITEM-ITEM-CF). Suppose assumptions A1 and A2 are 
satisfied. Then ITEM-ITEM-CF achieves expected regret 


E [7^(^)] < 


Train + aip, d) ■ {T - Tmin) log2('T - Tmin) 
+ £n (T — Tniax) 


( 2 ) 


where Ain = O(^) + Aax = g{i^,d)N, £N,d,u = h{d,v) fi = Ain + 

a{v, d) ■ (Aax - Ain)^ log2(Aax Tmin)- 


Proof Recall that during the beginning of ITEM-ITEM-CF it runs the routine 
MAKE-PARTITION (Ml, El, ei). This consumes almost 

MP(1) ^ 4 • 630 {d -f 1)^ Ml In^ In (Mi) (49) 
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recommendations (by Lemma A.4), and hence finishes in at most Tmp = MP(l)/iV time steps. For 
this initial exploratory period T < Tmp we will bound the regret with the trivial bound Tl{T) < T. 

Let us now deal with the regime between T^in and T^ax- Recall that the target Sj- used in the 
epoch is decreasing as until it plateaus at en when ^ < en, where C = Hence 

T* = riog2 —1 (50) 

En 

is the first epoch in which en is used. For a function g defined later, we will show that 

T*-l 

Tmp(i) + ^ Dr > g{p,d)N‘i+=^ = T^ax- (51) 

t' — 1 


Now since e^ = ^ ^ ■ 630(2d + ll)(d + , we get that 

((l48 • 2o) 


d+5 


T* > 


d+5 


630(2d+ ll)(d + 2)4 2^d+i8 


■N]. 


Also, 


TmP( 1) + '^ Dr > '^ Dr — ^Mr, 


T=1 


T — 1 


T = 1 
A 


(52) 


(53) 


where we used the fact that Dr = '^Mr- Recall Mr = C'M-i 2 ln(^), where Cm = 

- -— ’—{3d + 1), and for t < t* we have Er = C'/2'^, where C = p/( 148 • 20). Then we 

get 

T* —1 T*—1 


T 


MP(1) 


T=1 


^Cm (1/C)"+^ 5] 2^ 


(d+2) 


T — 1 




> -Cm (1/C) 


d+2 


630(2d+ ll)(d +2)4 25^+18 


d + 2 

1 \ + ^ d+2 . 

.JV3+5 A -p 


aC,d) 


(54) 


as wished. Hence, between T^m and Tmax the target Er for the epochs is indeed halving for each 
subsequent epoch. Let t{T) be the epoch of time T. Then, by Lemma B.l, for T G [T+m, d+aa;], 
where Tmin = Tmp + Tmin.i, the expected regret satisfies 


'R{T) — Tmp < -^ Et-D. 


r(T) 


(55) 


which we can further bound as 


7^(T) -Tmp<^ log2 
< -Y logs 


nr(T) \ 

T—1 


2C 

2r{T) 

2C 


2(r(r)+i)(d+i)^ 


(56) 


Now, since for T > Tmin the epoch t{T) is at most 1 + log 2 iog( 2 /c) ) ’ S®*- 


n{T) < Tmp + ^ loga 


^ _ 1 T Tmp \ 2{d+l)oiriT) + l)(d+l) 

C{d + 2)\og{2/C) Cm J ' ’ 
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1 


< TmP + —^ log 2 


C{d + 2) log(2/C)CMy 


\ 24(d+l) 


log2 {T-Tmp)2^+^ 


log2 


Cm log(2/C), 




d+l 

/ 1 1 \ d+2 

^ ti(w) <"■ ■ 

'-V-" 

^a{y,d) 


as we wished, which completes the proof of the sublinear regret regime. 


The case T > Tmax now follows. Recall that by Lemma B.l we get 


'R{T) < Tmp + 
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V 


r(T) 


^tDt, 


(58) 


which we can in turn split between before T^ax and after Tmax as 




T = l 


(59) 


^ Tmp + d)T,^% log2 (Tmax) Tsn(T Tmax ), (60) 

'-' 

=p 


as claimed, and where the last inequality is due to the sublinear regime proved above. ■ 


We are now ready to bound the cold-start time of ITEM-ITEM-CF. Recall that cold-start time of 
a recommendation algorithm A is dehned as the least T -f L such that for all A > L we have 
E [7^M) (T -f A) - (T)] < 0.1 A. 

Theorem 5.1 (Cold-Start Performance). Suppose assumptions A1 and A2 are satisfied. Then the 
algorithm ITEM-ITEM-CE has cold-start time Tcold-start = ^ + 0(1/4 

Proof. First recall the usual dehnitions: = |Mt-, T^ = Tmp + and 

Tmp = f(v,d)/N, where f(p,d) is the number of recommendations required for the ini¬ 
tial MAKE-PARTITION call (as Stated in Lemma A.4), and Tmin.r = (as stated in 

Lemma B.l). We will show the bound in the dehnition of cold-start time with T = Tmp and 
r = Tmzn.l, which implies Tcold-start — Tmp T Tmin,l' 

To complete the proof, we shall establish the following two properties; 

(i) For any A > 0, E [7^(rMp -f Tmin,i + A) — TZ{Tmp)] < 0.1(Tmin,i + A), for Tmp + 
Tmin,l + A < T 2 . This condition says that the desired property holds for times involving 
the first epoch, and 

(ii) E [n(Tr + A) - 7^(T,)] < 0.05(A -f 4_i), for A < 4 and r > 2. 

Before we show how the above two properties imply the desired result, we note that (i) follows 
directly from Lemma B.l, and (ii) will be proved at the end. 

Now let complete the proof using (i) and (ii). To that end, consider a time of the form Tcoid-start + 
A = Tmp + Tmin,i + A, for any A > 0. Let r* > 1 be the epoch to which Tmp + Tmin,i + A 
belongs, i.e. 


Tt-» < Tmp + Tmin,i + A < Tt-.+i. 
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Define t = Tmp + Tmin,i + A — T^-* > 0. We shall argue t* = 1 and r* > 1 separately. For 
T* = 1, (i) implies the desired result. For t* > 1, we use (ii) to argue it as follows: 


E [7^ (Tmp + T^in,i + A) — 7^ (Tmp)] E [7^ (T 2 ) — TZ (Tmp)| . 

T:s+A—<0.1 -£>1 by (z) 

<0.05(t+i3x-l) by (zi) 

+ (e E [7^ (Tr) - n (T^-i)]_ ) +e1^'^tEmEE(T^ 

^ ^ <0.05(Dt-+D.^_i) by (zz) 

< 0.05 (f + 2 ^ T)^ ] 


< 0.1 • (A + T„iin,l)- 


(61) 


This establishes the desired result that Tcoid-start = Tmp + Tm.in,i = f{v, d)/N + 0{1 /p). 

Proof of ill): Now we argue the remaining property (ii). Lemma B.l tells us that for A S 
(Tmzrz.r, Dr) we have that E [TZ [Tr + A) — TZ (Tr)] < ^SrA, i.e. 

148 

E [TZ {Tr + Tmin^r) ~ ^ {Tr)] < —^£TTmin,T- 

Thus, for A < Dr, we have that 

148 148 

E [TZ {Tr + A) — TZ (Tr)] < —j^£rTmin,T + 

<0.05 for t>2 

In above we used the fact that for A < Tmin,i, TZ {Tr + A) < 7?. {Tr + Tmin.i)- Using the fact that 
Train,T — ^ In A 0.05^^ ln(A)^ = 0.05Dr-i, we conclude 

E [TZ {Tr + A) - 7^ (Tr)] < 0.05 (A + Dr-i). (63) 


Theorem 5.3 (Asymptotic linear regret is unavoidable). Consider an item space p, satisfying 
assumptions A1 and A2. Then any online algorithm must have expected asymptotic regret 
E [7^(T)] > C{v, N) ■ T, where C{v, N) = {1 - 2v)/N. 


Proof Let {T, be the set of distinct items that have been recommended up to time TN. 

Then we have 


E mT)] = ^E 


'TN 


E 9^^ “ Nt,It) 


= —E 
N 


> —E 
- N 


kT TN 


EE 9 ^L=z ;,(1 - Tut,iJ 


U‘=i 
kT 1 


E9(^ NTk,lk) 


U-l 


where is the first time in which the item is recommended to any user. Now note that for each 
k by {A 2 ) we have that E 5(1 — > 1 — 2v, since when we have no prior information 

about ik the best we can do is to recommend it to the user that likes the largest fraction of items. 
Hence we get 


E[7^(T)]>i^A:,. 


(64) 
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Since each item can be recommended to each user at most once, we see that by the recom¬ 

mendation at least T different items must have been recommended (that is, kt > T). We can then 
conclude that 


E > 


1 - 2v 
N 


(65) 


C{v,d) 


as we wished. 


D More on Doubling Dimension 


In this section we provide examples of spaces with low doubling dimension, give some useful prop¬ 
erties, and describe experiments indicating that doubling dimension is often small in practice. 

Example D.l. Consider an item space p, over N users that assigns probability at least w > 0 to K 
distinct item types with separation at least a > 0. Then, since p{B{x, a)) < 1, and p{B{x, a/2)) > 
w, we have that 


d = max 


sup 


p{B{x,r)) p{B{x,a)) 


< 


p{B{x,r/2)) p{B{x,a/2)) 


< log 2 


w 


= log 2 l/w. 


( 66 ) 


Similarly, if we only know that there are at most K equally likely item types we can bound the 
doubling dimension as 


d = max 


sup 


< 


1 


T{B{x,r)) 
p{B{x,r/2)) - 1/K 


= log2 K. 


(67) 


With the example above in mind, we would like to emphasize that doubling dimension assumptions 
are strictly more general than the style of assumptions made in [4] (finite K with separation assump¬ 
tions) because (a) doubling measure require no separation assumptions (that is, two item types x 
and y that are arbitrarily close to each other can have positive mass) and (b) the number of types of 
positive mass is not bounded by a finite K anymore, but instead can grow with the number of users. 

Example D.l. Consider an item space p such that it assigns probability 1/K to K item types 
randomly uniformly drawn from { — 1, Then, for each two item types i,j we have that 

p i [-4, .6]}j < ^ -6]) < (f) exp(-0(A^)), (68) 

where the first inequality is due to a union bound, and the second to a Chernoff bound. Hence, with 
high probability we get that d > yjk ~ Example D.l we also have 

w.h.p. 

that d < log 2 (K), and hence we can conclude that with high probability we have that d = log 2 {K). 

Proposition D.l. Let p be an item space for N users with doubling dimension d. Then for any item 
type X G {—1,1}^ with p{x) > Qwe have 

p{B [x, r)) > r‘^. (69) 

Proposition D.l. Let p be an item space for N users with doubling dimension d, let C be an e-net 
for p, let j be an arbitrary item, let cj G C be such that yj,cj < £, cind let TOcj — Then, 

for each r G [e/2,1/2], there are at most rricj items in C within radius r of j. 


Proof. By the doubling dimension of p we get 


piB{cj,r-G-e] <p{B{cj,e)) 


r 5£/4 


= rric 


■5e/4\ 


■ 


w 
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We will now use this bound on /i (S (cj, r + le)) to show that we could pack at most 
rricj items from C within r of j. 

Since C is an e-net, each two items i,j G C are at least e/2 apart, and hence the balls of radius e/4 
around each i G C are disjoint. Say that there are = \Cj \ items C within distance r of j, where 
Cj = {c G C \ jc,j < f}- Then we get 

^ ^m(^(c,£/4)) = m( U S(c,£/ 4)) </r(S(c„r + 5e/4)), (70) 

cGCj ceCj 


where the first inequality is due to Proposition D. 1 and the last inequality is due to UcgCj ^(c, e/4) C 
B{cj,r + 5e/4). Using the bound from eq. (* *) we arrive at 


K < 



5e/4y 


(71) 


as we wished. ■ 

Finally, we would like to note that doubling dimension is not only a “proof technique”: it can be 
estimated from data and tends to be small in practice. To illustrate this point, we calculate the 
doubling dimension on the Jester Jokes Dataset^ and for the MovieLens IM Dataset^. For the 
MovieLens dataset we consider the only movies that have been rated by at least 750 users (to ensure 
some density). 

The Jester dataset contains ratings of one hundred jokes by over seventy thousand users. The dataset 
is fairly dense (as the average number of ratings per user is over fifty), which makes it a great dataset 
for calculating the doubling dimension. For the MovieLens IM Dataset we consider the only movies 
that have been rated by at least 750 users (to ensure some density). 

The Jester ratings are in [—10,10], with an average of 2, so we make ratings greater than 2 a Ru,i = 
-fl, and ratings at most 2 a Ru,i = —1. For the MovieLens IM Dataset we make ratings 1, 2, 3 into 
— 1, and 4, 5 into -1-1. We then estimate the doubling dimension as follows: 

• For each pair of items {i,j), we calculate dij.A as fraction of users that agree on them, 
where the A subscript is put to denote our assumption that each entry has a noise probability 
of A (that is, P(i?u,i 7^ i) = A), where R is the empirical ratings matrix and L is the 
true, noiseless, ratings matrix. 

• Assuming that each entry has a noise probability of A = 0.20, we estimate the true distance 
dij as the solution to = (1 — dij){2A{l — A)) + dij{A‘^ + (1 ~ ^)^)- 

• For each item i and r in {0, ..., 1}, let ^ be the number of items such that 

di,j < r. 

• For each item i let di be the least such that Ni^ 2 r/Afi^r < 2"^’ for each r in {0, |}. 

• The figs. 1 and 2 show the histogram of the {di}. 


’[26], and data available on http : / /goldberg . berkeley . edu/ jester-data/ 

* [27], and data available on http : //grouplens . org/datasets/movielens/ 
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1 r^. . Figure 2: MovieLens Doubling Dimen- 

Figure 1: Jester Doubling Dimensions 

sions 


E Chernoff Bound 


The following is a standard version of Chernoff Bound [28] that we use throughout the paper. 


Theorem E.l (Chernoff Bound). Let Xi , • • • , Xn be independent random variables that take value 
in [0,1]. Let X = ^ ~ Then, for any e > 0, 


P > (1 + s) < exp 
P (X < (1 — e) X) < exp 



, and 
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